Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
نویسندگان
چکیده
Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces.
منابع مشابه
Unscented von Mises-Fisher Filtering
We introduce the Unscented von Mises–Fisher Filter (UvMFF), a nonlinear filtering algorithm for dynamic state estimation on the n-dimensional unit hypersphere. Estimation problems on the unit hypersphere occur in computer vision, for example when using omnidirectional cameras, as well as in signal processing. As approaches in literature are limited to very simple system and measurement models, ...
متن کاملMixture of Watson Distributions: A Generative Model for Hyperspherical Embeddings
Machine learning applications often involve data that can be analyzed as unit vectors on a d-dimensional hypersphere, or equivalently are directional in nature. Spectral clustering techniques generate embeddings that constitute an example of directional data and can result in different shapes on a hypersphere (depending on the original structure). Other examples of directional data include text...
متن کاملmovMF: An R Package for Fitting Mixtures of von Mises-Fisher Distributions
Finite mixtures of von Mises-Fisher distributions allow to apply model-based clustering methods to data which is of standardized length, i.e., all data points lie on the unit sphere. The R package movMF contains functionality to draw samples from finite mixtures of von Mises-Fisher distributions and to fit these models using the expectation-maximization algorithm for maximum likelihood estimati...
متن کاملExpectation Maximization for Clustering on Hyperspheres
High dimensional directional data is becoming increasingly important in contemporary applications such as analysis of text and gene-expression data. A natural model for multi-variate directional data is provided by the von Mises-Fisher (vMF) distribution on the unit hypersphere that is analogous to multi-variate Gaussian distribution in R. In this paper, we propose modeling complex directional ...
متن کاملMultitarget tracking with the von Mises-Fisher filter and probabilistic data association
Directional data emerge in many scientific disciplines due to the nature of the observed phenomena or the working principles of a sensor. The problem of tracking with direction-only sensors is challenging since the motion of the target typically resides either in 3D or 2D Euclidean space, while the corresponding measurements reside either on the unit sphere or the unit circle, respectively. Fur...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of Machine Learning Research
دوره 6 شماره
صفحات -
تاریخ انتشار 2005